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Abstract 

The problem of clustermg is considered, for the case 
when each data point is a sample generated by a sta- 
tionary ergodic process. We propose a very natural 
asymptotic notion of consistency, and show that sim- 
ple consistent algorithms exist, under most general 
non-parametric assumptions. The notion of consis- 
tency is as follows: two samples should be put into 
the same cluster if and only if they were generated 
by the same distribution. With this notion of consis- 
tency, clustering generalizes such classical statistical 
problems as homogeneity testing and process clas- 
sification. We show that, for the case of a known 
number of clusters, consistency can be achieved un- 
der the only assumption that the joint distribution 
of the data is stationary ergodic (no parametric or 
Markovian assumptions, no assumptions of indepen- 
dence, neither between nor within the samples). If 
the number of clusters is unknown, consistency can 
be achieved under appropriate assumptions on the 
mixing rates of the processes. In both cases we give 
examples of simple (at most quadratic in each argu- 
ment) algorithms which are consistent. 



1 Introduction 

Given a finite set of objects, the problem is to "clus- 
ter" similar objects together. This intuitively sim- 
ple goal is notoriously hard to formalize. Most of 
the work on clustering is concerned with particular 
parametric data generating models, or particular al- 
gorithms, a given similarity measure, and (very of- 
ten) a given number of clusters. It is clear that. 



as in almost learning problems, in clustering finding 
the right similarity measure is an integral part of the 
problem. However, even if one assumes the similar- 
ity measure known, it i s hard t o define what a good 
cluste ring is Kleinber3 ( 2002 ): Zadeh fc Ben-Davidl 
(l2009f ). What is more, even if one assumes the sim- 
ilarity measure to be simply the Euclidean distance 
(on the plane), and the number of clusters k known, 
then clustering may still appear intractable for com- 
putational reasons. Indeed, in this case finding k cen- 
tres (points which minimize the cumulative distance 
from each point in the sample to one of the centres) 
seem s to be a natur a l goal , but this problem is NP- 
hard iMahaian et~aLl (|2009l) . 

In this work we concentrate on a subset of the 
clustering problem: clustering processes. That is, 
each data point is itself a sample generated by a cer- 
tain discrete-time stochastic process. This version 
of the problem has numerous applications, such as 
clustering biological data, financial observations, or 
behavioural patterns, and as such it has gained a 
tremendous attention in the literature. 

The main observation that we make in this work 
is that, in the case of clustering processes, one can 
benefit from the notion of ergodicity to define what 
appears to be a very natural notion of consistency. 
This notion of consistency is shown to be satisfied by 
simple algorithms that we present, which are polyno- 
mial in all arguments. This can be achieved without 
any modeling assumptions on the data (e.g. Hidden 
Markov, Gaussian, etc.), without assuming indepen- 
dence of any kind within or between the samples. 
The only assumption that we make is that the joint 
distribution of the data is stationary ergodic. The 
assumption of stationarity means, intuitively, that 
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the time index itself bares no information: it does 
not matter whether we have started recording obser- 
vations at time or at time 100. By virtue of the 
ergodic theorem, any stationary process can be rep- 
resented as a mixture of stationary ergodic processes. 
In other words, a stationary process can be thought of 
as first selecting a stationary ergodic process (accord- 
ing to some prior distribution) and then observing its 
outcomes. Thus, the assumption that the data is sta- 
tionary ergodic is both very natural and rather weak. 
At the same time, ergodicity means that, in asymp- 
totic, the properties of the process can be learned 
from observation. 

This allows us to define the clustering prob- 
lem as follows. N samples are given: xi = 
{x\,. . . . . . ,XAr = {x^, . . ■,x^^). Each sample 

is drawn by one out of k different stationary ergodic 
distributions. The samples are not assumed to be 
drawn independently; rather, it is assumed that the 
joint distribution of the samples is stationary ergodic. 
The target clustering is as follows: those and only 
those samples are put into the same cluster that were 
generated by the same distribution. The number k of 
target clusters can be either known or unknown (dif- 
ferent consistency results can be obtained in these 
cases). A clustering algorithm is called asymptot- 
ically consistent if the probability that it outputs 
the target clustering converges to 1, as the lengths 
(ni, . . . , riAf) of the samples tend to infinity (a vari- 
ant of this definition is to require the algorithm to 
stabilize on the correct answer with probability 1). 
Note the particular regime of asymptotic: not with 
respect to the number of samples N, but with respect 
to the length of the samples ni , . . . , njv . 

Similar formulations have appeared in the litera- 
ture before. Perhap s the m o st close approach is mix - 
ture models ISmvthI (|l997f ): IZhong fc GhoshI (|2003h : 
it is assumed that there are k different distributions 
that have a particular known form (such as Gaus- 
sian, Hidden Markov models, or graphical models) 
and each one out of N samples is generated inde- 
pendently according to one of these k distributions 
(with some fixed probability). Since the model of the 
data is specified quite well, one can use likelihood- 
based distances (and then, for example, the fc-means 
algorithm), or Bayesian inference, to cluster the data. 



Clearly, the main difference from our setting is in that 
we do not assume any known model of the data; not 
even between-sample independence is assumed. 

The problem of clustering in our formulation gener- 
alizes two classical problems of mathematical statis- 
tics. The first one is homogeneity testing, or the two- 



sample problem, 
and X2 = (xf , . . . 



Two samples xi = (x\, 

J are given, and it is required 
to test whether they were generated by the same dis- 
tribution, or by different distributions. This corre- 
sponds to clustering just two data points {N = 2) 
with the number k of clusters unknown: either k — 1 
or k = 2. The second problem is process classifi- 
cation, or the three-sample problem. Three samples 
xi , X2 , X3 are given, it is known that two of them 
were generated by the same distribution, while the 
third one was generated by a different distribution. 
It is required to find out which two were generated 
by the same distribution. This corresponds to clus- 
tering three data points, with the number of clus- 
ters known: k — 2. The classical approach is of 
course to consider Gaussian i.i.d. data, but gen- 
eral non-param e tric so lutions exist not only for i.i.d. 
data LehmannI (19861), but also for Markov chains 
iGutmanI (|l989^ ■ and under certain mixing rates con- 
ditions. What is important for us here, is that the 
three-sample problem is easier than the two-sample 
problem; the reason is that k is known in the lat- 
ter cas e but not in the former. Indeed, in Rvabkol 
(l2010bl) it is shown that in general, for stationary 
ergodic (binary-valued) processes, there is no solu- 
tion to the two-sample problem, even in the weakest 
asymptotic sense. However, a solution to the three- 
sample problem, for (r eal-valued) stationary erg odic 
processes was given in lRvabko fc Rvabkol ( 2010[ ) . 

In this work we demonstrate that, if the number 
k of clusters is known, then there is an asymptoti- 
cally consistent clustering algorithm, under the only 
assumption that the joint distribution of data is sta- 
tionary ergodic. If k is unknown, then in this gen- 
eral case there is no consistent clustering algorithm 
(as follows from the mentioned result for the two- 
sample problem). However, if an upper-bound a„ on 
the a-mixing rates of the joint distribution of the 
processes is known, and a„ — ?> 0, then there is a 
consistent clustering algorithm. Both algorithms are 
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rather simple, and are based on the empirical esti- 
mates of the so-called distributional distance. For 
two processes pi,p2 a distributional distance d is de- 
fined as X^feli Wk\pi{Bk)-p2{Bk)\,wheYe Wk are pos- 
itive summable real weights, e.g. Wk = 2"*^, and Bk 
range over a countable field that generates the sigma- 
algebra of the underlying probability space. For ex- 
ample, if we are talking about finite-alphabet pro- 
cesses with the binary alphabet A = {0, 1}, Bk would 
range over the set A* = UkefiA'^; that is, over all 
tuples 0,1,00,01,10,11,000,001,... (of course, we 
could just as well omit, say, 1 and 11); therefore, the 
distributional distance in this case is the weighted 
sum of differences of probabilities of all possible tu- 
ples. In this work we consider real-valued processes, 
so Bk have to range through a suitable sequence of in- 
tervals, all pairs of such intervals, triples, etc. (see the 
formal definitions below). This distance has proved 
a useful tool for solving variou s statistical prob- 
lems c o ncerning ergodic processes Rvabko fc Rvabkol 
(l2010l) : lRvab"ko[ (l2010al) . 

Although this distance involves infinite summa- 
tion, we show that its empirical approximations can 
be easily calculated. For the case of a known number 
of clusters, the proposed algorithm (which is shown 
to be consistent) is as follows. (The distance in the 
algorithms is a suitable empirical estimate of d.) The 
first sample is assigned to the first cluster. For each 
J = 2..fc, find a point that maximizes the minimal 
distance to those points already assigned to clusters, 
and assign it to the cluster j. Thus we have one 
point in each of the k clusters. Next, assign each of 
the remaining points to the cluster that contains the 
closest points from those k already assigned. For the 
case of an unknown number of clusters fc, the algo- 
rithm simply puts those samples together that are not 
farther away from each other than a certain thresh- 
old level, where the threshold is calculated based on 
the known bound on the mixing rates. In this case, 
besides the asymptotic result, finite-time bounds on 
the probability of outputting an incorrect clustering 
can be obtained. Each of the algorithms is shown to 
be at most quadratic in each argument. 

Therefore, we show that for the proposed notion of 
consistency, there are simple algorithms that are con- 
sistent under most general assumptions. While these 



algorithms can be easily implemented, we have left 
the problem of trying them out on particular applica- 
tions, as well as optimizing the parameters, for future 
research. It may also be suggested that the empiri- 
cal distributional distance can be replaced by other 
distances, for which similar theoretical results can be 
obtained. An interesting direction, that could pre- 
serve the theoretical generality, wo uld be to use data 
compr essors. These were used in iRvabko &: Astola 



(|2006f ) for the related problems of hypotheses test- 
ing, leading both to theoretical and practical results. 
As far as clustering is concerned, compression-based 
methods w ere used (without asympto tic consistency 
analysis) in lCilibrasi fc Vitanvi ( 2005f) . and (in a dif- 
ferent way) in iBagnall et al. 020061) . Combining our 
consistency framework with these compression-based 
methods is a promising direction for further research. 



2 Preliminaries 

Let A be an alphabet, and denote A* the set of tuples 
Ui^i^'. In this work we consider the case A = R; ex- 
tensions to the multidimensional well as to 
more general spaces, are straightforward. Distribu- 
tions, or (stochastic) processes, are measures on the 
space {A°°,J^A^), where J^a^^ is the Borel sigma- 
algebra of A°°. When talking about joint distribu- 
tions of N samples, we mean distributions on the 
space ((A^)°°, J^(^N-)oo). 

For each fc, Z e N, let S'^'' be the partition of the 
set A'' into fc-dimensional cubes with volume h'^ = 
(l/l)'' (the cubes start at 0). Moreover, define B'^ = 
UigN^'"'' and B = U^iS*^. The set {B x A°° : B e 
B^'^ , fc, Z G N} generates the Borel cr-algebra on R°° = 
A°° . For a set i? e B let be the index k of the set 
5*= that B comes from: \B\ — k : B ^ B^ . 

We use the abbreviation Xi,,k for Xi, . . . , Xk- For 
a sequence x e A" and a set _B G S denote i/(x, B) 
the frequency with which the sequence x falls in the 
set B. 



i+|f3|_i)e 



3} if n > 
otherwise. 
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A process p is stationary if = B) = 

piXt,,t+\B\-i = B) for any B e A* and t e N. We 
further abbreviate := /o(-'^i..|b| = S)- A sta- 

tionary process p is called (stationary) ergodic if the 
frequency of occurrence of each word i3 in a sequence 
Xi,X2,. ■ ■ generated by p tends to its a priori (or 
limiting) probability a.s.: p(lim„_>.cx3 i^(^i..n, B) = 
p{B)) — 1. Denote £ the set of all stationary ergodic 
processes. 

Definition 1 (distributional distance) . The distribu- 
tional distance is defi n ed for a pair of processes pi, p2 
as follows (e.g. ICrad 1(198^ }) 

oo 

d{pi,P2)^ ^ WynWl ^ \pi{B) - P2{B)\, 



BeB" 



where w-j 



2-3. 



(The weights in the definition are fixed for the 
sake of concreteness only; we could take any other 
summable sequence of positive weights instead.) In 
words, we are taking a sum over a series of partitions 
into cubes of decreasing volume (indexed hy I) of all 
sets , fc G N, and count the differences in probabil- 
ities of all cubes in all these partitions. These differ- 
ences in probabilities are weighted: smaller weights 
are given to larger k and finer parti tions. It is e asy 
to see that d is a metric. We refer to iGravl (|l988l ) for 
more information on this metric and its properties. 

The clustering algorithms presented below are 
based on empirical estimates of the distance d: 

oo 

J2 ^rnWl IK^l\„,,S)-^(^L„,,B)I 



m,l = l BeB"^-' 



(1) 



where ni,n2 eN, p e S, X{ ^. e A"'. 

Although the expression ^ involves taking three 
infinite sums, it will be shown below that it can be 
easily calculated. 

Lemma 1 {d is consistent). Let pi,p2 G £ cind let 
two samples xi = X^ „^ and X2 = Xf „^ be gener- 
ated by a distribution p such that the marginal distri- 



bution of XI „^ is Pi, i = 1,2, and the joint distribu- 
tion p is stationary ergodic. Then 

lim d{Xl Xl ) ^d{pi,p2) p-a.s. 

ni,n2— >oo 

Proof. The idea of the proof is simple: for each set 
B € B, the frequency with which the sample Xi falls 
into B converges to the probability pi{B), and analo- 
gously for the second sample. When the sample sizes 
grow, there will be more and more sets B ^ B whose 
frequencies have already converged to the probabili- 
ties, so that the cumulative weight of those sets whose 
frequencies have not converged yet, will tend to 0. 

For any e > we can find an index J such 
that J2Tj=j '^i''^] ^Z"^- Moreover, for each m,l 
we can find such elements _B™'\ . . . , B™^^, for some 
G N, of the partition B™'' that p*(U*"i'B"'') > 
1 — e/6JwmWi. For each Bj^'\ where m,l < J and 

J < t„,,i, we have i^{{Xl ...,X^J, B^^'') ^ piiB^^') 
a.s., so that 



H{Xl,...,Xl^),Bf)~p,{B\ 



<p,iB^^')s/{6Jw,) 



for all Hi > u, for some m G N; define t/™'' := u. 
Let U :— ma.x„i^i<jj<t^ , t/™'' {U depends on the 
realization X^, X2, . . .). Define analogously V for 
the sequence (X^,X|,...). Thus for ni > U and 
n2 > V we have 



|(i(xi, X2) - d(pi,p2)\ = 

wr^wi (k(xi,B)->'(x2,B)|-|pi(B)-P2(B)|) 

-71 , ; = 1 ^^^k,i 

< Y ""raWi Y t">(k(xi,S)-pi(B)|-|-ll'(x2,B)-p2(B)|) 

m,l = l BeB^-'- 

< Y f™"-! J^(k(xi,i3r-')-pi(Br-')| 

m,l=l i = l 

+ |i.(x2,s;"-') - P2(Sr'')|) +2e/3 

< Y, "^rnWi y^(pi(B"''')£/(6Jt»„mi) 

mj = l i=\ 

+ P2{Bl''^)e/{eJw„^wi)) + 2e/3 < e, 



which proves the statement. 



□ 
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3 Main results 

The clustering problem can be defined as follows. We 
are given N samples Xi,...,xjv, where each sam- 
ple Xi is a string of length Ui of symbols from A: 
Xi = XI Each sample is generated by one out 
of k different unknown stationary ergodic distribu- 
tions pi,...,pk S £■ Thus, there is a partitioning 
/ = {/i, . . . , /fe} of the set {l..A^} into k disjoint sub- 
sets Ij,j — l..k 



{l..iV} = U,ti/„ 

such that Xj, 1 < J < is generated by pj if and 
only if j G Ij . The partitioning / is called the tar- 
get clustering and the sets < i < k, are called 
the target clusters. Given samples Xi, . . . ,XAr and a 
target clustering I, let /(x) denote the cluster that 
contains x. 

A clustering function F takes a finite number of 
samples Xi, . . . , xjv and an optional parameter k (the 
target number of clusters) and outputs a partition 
f^(xi,...,x^,(fc)) = {Ti,...,Tk} of the set {1..N}. 

Definition 2 (asymptotic consistency). Let a finite 
number N of samples be given, and let the target clus- 
tering partition be I . Definen — min{?ii, . . . ,njv}- A 
clustering function F is strongly asymptotically con- 
sistent i/i^(xi, . . . , xjv, (fc)) — I from some n on with 
probability 1. A clustering function is weakly asymp- 
totically consistent if P{F{xi, . . . ,X7v, (fc)) = /) — s> 1. 

Note that the consistency is asymptotic with re- 
spect to the minimal length of the sample, and not 
with respect to the number of samples. 

3.1 Known number of clusters 

Algorithm [1] is a simple clustering algorithm, which, 
given the number k of clusters, will be shown to be 
consistent under most general assumptions. It works 
as follows. The point Xi is assigned to the first clus- 
ter. Next, find the point that is farthest away from xi 
in the empirical distributional distance d, and assign 
this point to the second cluster. For each j = 3..fc, 
find a point that maximizes the minimal distance to 
those points already assigned to clusters, and assign 



it to the cluster j. Thus we have one point in each 
of the k clusters. Next simply assign each of the re- 
maining points to the cluster that contains the closest 
points from those k already assigned. (One may no- 
tice that Algorithm [T] is one iteration of the fc-means 
algorithm, with a specific initialization, and a spe- 
cially designed distance.) 

Algorithm 1 The case of known number of clusters k 
INPUT: The number of clusters k, samples 
Xi, . . . ,xn. 

Initialize: j := 1, ci := 1, Ti := {^d}. 
for j := 2 to fc do 

Cj :— argmax{z = 1, . . . , : min^^j"^ d(xi, XcJ} 

Tj := {xcJ 
end for 

for i ^ 1 to N do 

Put X,- into the set T ■ k j/ ^ 

« argminj^j !i(xi,Xc^. ) 

end for 

OUTPUT: the sets Tj, j = l..k. 



Proposition 1 (calculating o!(xi, X2)). For two sam- 
ples Xi = Xl „^ and X2 = X^ the compu- 
tational complexity (time and space) of calculating 
the empirical distributional distance o?(xi,X2) (QJ) is 
0{n^ log s^l^), where n = ma,x{ni,n2) and 



min \X} -X 



i=l..niJ = l..n2,Xl=^Xf ■' 

Proof. First, observe that for fixed m and the sum 
r™'':= HXl..^,,B)-v{Xl,^^,B)\ (2) 

has not more than rii + n2 ~ 2m -\- 2 non-zero 
terms (assuming m < ni,n2] the other case is 
obvious). Indeed, for each i = 0,1, in the 
sample x.; there are — m + 1 tuples of size 
k- XI . ^2..m-Hi, • • ■ , ^;i-™+i..„i ■ Therefore, the 
complexity of calculating T™'' is 0{ni -\- n2 — 2m -\- 
2) — 0{n). Furthermore, observe that for each m, for 
all / > logSjYiL term T™'' is constant. Therefore, 
it is enough to calculate r™'i, . . . , T'"'i°«"m'n, since 
for fixed m 



1 = 1 
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(that is, we double the weight of the last non- 
zero term). Thus, the complexity of calculating 
EZi^mWiT'^^' is 0{nlogs-lJ. Finally, for all 
m > n we have T™'' = 0. Since (i(xi,X2) = 
/=i "^ni, wiT™-'^ , the statement is proven. □ 

Theorem 1. Let iV G N and suppose that the sam- 
ples Xi, . . . , xat are generated in such a way that the 
joint distribution is stationary ergodic. If the cor- 
rect number of clusters k is known, then Algorithm]^ 
is strongly asymptotically consistent. Algorithm [I] 
makes 0{kN) calculations of d{-,-), so that its com- 



putational complexity is 0{kNn^„,^\oRs„-.}„), where 



max^^]^ rii ano 



mm 

U ,V—1 . . N ,U^V . .rtu ,i— 1. .yii; ,^i'7^^j 



Observe that the samples are not required to be 
generated independently. The only requirement on 
the distribution of samples is that the joint distribu- 
tion is stationary ergodic. This is perhaps one of the 
mildest possible probabilistic assumptions. 

Proof. By Lemma [TJ d{xi,Xj), i,j G {l..A^} con- 
verges to if and only if x^ and Xj are in the same 
cluster. Since there are only finitely many samples x^, 
there exists some S > such that, from some n on, 
we will have d(xi, Xj) < d if x^, Xj belong to the same 
target cluster (/(x.;) = /(xj)), and d{xi,Xj) > S oth- 
erwise (/(xi) 7^ /(xj)). Therefore, from some n on, 
for every j < k we will have max{i = l,...,iV : 
min^^^ (i(xi, Xc()} > S and the sample Xc^, where 

Cj = argmaxji — 1,...,N : min^^j'^ (I(xi, XcJ}, will 
be selected from a target cluster that does not contain 
any Xc- , i < j. The consistency statement follows. 

Next, let us find how many pairwise distance esti- 
mates (i(xi , Xj ) the algorithm has to make. On the 
first iteration of the loop, it has to calculate d(xi, Xcj) 
for all i — 1..N. On the second iteration, it needs 
again (i(xi,Xci) for all i — 1..N, which are already 
calculated, and also (i(xi,Xc2) for all i = 1..N, and 
so on: on jth iteration of the loop we need to cal- 
culate d(xi,Xc ), i = 1..N, which gives at most kN 
pairwise distance calculations in total. The state- 
ment about computational complexity follows from 



this and Proposition [TJ indeed, apart from the calcu- 
lation of d, the rest of the computations is of order 
0{kN). □ 

Complexity— precision trade— ofT. The bound on 
the computational complexity of Algorithm [1] given 
in Theorem [1] is given for the case of precisely cal- 
culated distance estimates d{-,-). However, precise 
estimates are not needed if we only want to have an 
asymptotically consistent algorithm. Indeed, follow- 
ing the proof of Lemma [1] it is easy to check that if 
we replace in ([T]) the infinite sums with sums over any 
number of terms to„, Z„ that grows to infinity with 
n = min(ni, 77.2), and if we replace partitions iJ™-' by 
their (finite) subsets 5™^'-" which increase to i?™"', 
then we still have a consistent estimate of c?(-, •). 

Definition 3 (d). Let m„,/„ be some sequences of 
numbers, c B™'' for all m,l,n G N, and de- 

note n := min{ni,n2}. Define 



m=l 1=1 BGB"','," 



(3) 



Lemma 2 (d is consistent). Assume the conditions of 
Lemma[l\ Let In and m„ be any sequences of integers 
that go to infinity with n, and let, for each m, / G N, 
the sets S™''-"^ n & N be an increasing sequence of 
subsets ofB"'''^, such that U„6n-B"^''" = B™-'. Then 

lim d{Xl Xl ) = d{pi,p2) p-a.s.. 
Proof. It is enough to observe that 

^™ y'y'wmu;; V \pi{B) - p2{B)\ 

ni,n2— fcjo ^ — ^ ^ — ^ ^ — ' 

m=l 1=1 BeS""''." 

= d{pi,P2), 

and then follow the proof of Lemma [T] □ 

If we use the estimate c?(-, •) in Algorithm[l] (instead 
of d{-, •)), then we still get an asymptotically consis- 
tent clustering function. Thus the following state- 
ment holds true. 
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Proposition 2. Assume the conditions of Theo- 
reml^ For all sequences m„,Z„ of numbers that in- 
crease to infinity with n, there is a strongly asymp- 
totically consistent clustering algorithm, whose com- 
putational complexity is 0{kNnaxi^^mn^^^ln^__^^). 



On the one hand, Proposition [2] can be thought of 
as an artifact of the asymptotic definition of consis- 
tency; on the other hand, in practice precise calcula- 
tion of •) is hardly necessary. What we get from 
Proposition [5] is the possibility to select the appro- 
priate trade-off between the computational burden, 
and the precision of clustering before asymptotic. 

Note that the bound in Proposition [5] does not in- 
volve the sizes of the sets B"^''''^; in particular, one 
can take _B™''>" = B"^''- for all n. This is because, 
for every two samples „ and „, this sum has 
no more than 2n non-zero terms, whatever are m,l. 
However, in the following section, where we are af- 
ter clustering with an unknown number of clusters 
fc, and thus after controlled rates of convergence, the 
sizes of the sets B"^'^'" will appear in the bounds. 

3.2 Unknown number of clusters 

So far we have shown that when the number of clus- 
ters is known in advance, consistent clustering is pos- 
sible under the only assumption that the joint distri- 
bution of the samples is stationary ergodic. How- 
ever, under this assumption, in general, consistent 
clustering with unknown number of cluste r s is im - 
possible. Indeed, as was shown in iRvabkd (l2010bh . 
when we have only two binary-valued samples, gener- 
ated independently by two stationary ergodic distri- 
butions, it is impossible to decide whether they have 
been generated by the same or by different distribu- 
tions, even in the sense of weak asymptotic consis- 
tency (this holds even if the distributions come from 
a smaller class: the set of all B- processes) . There- 
fore, if the number of clusters is unknown, we have 
to settle for less, which means that we have to make 
stronger assumptions on the data. What we need is 
known rates of convergence of frequencies to their ex- 
pectations. Such rates are provided by assumptions 
on the mixing rates of the distribution generating the 
data. Here we will show that under rather mild as- 



sumptions on the mixing rates (and, again, without 
any modeling assumptions or assumptions of inde- 
pendence), consistent clustering is possible when the 
number of clusters is unknown. 

In this section we assume that all the samples 
are [0, l]-valued (that is, S [0,1]); extension 
to arbitrary bounded (multidimensional) ranges is 
straightforward. Next we introdu ce mixing coeffi- 
cients, mainly following iBosd (Il996l ) in formulations. 
Informally, mixing coefficients of a stochastic process 
measure how fast the process forgets about its past. 
Any one-way infinite stationary process Xi,X2, ■ ■ ■ 
can be extended backwards to make a two-way infi- 
nite process . . . , X_i,Xq, Xi, . . . with the same dis- 
tribution. In the definition below we assume such an 
extension. Define the a mixing coefficients as 

a{n) — sup 

Aeo-(...,x_i,Xo),secr(x„,x„+i,...)) 

\P{AnB)-P{A)P{B)\, (4) 

where cr(..) stays for the sigma-algebra generated by 
random variables in brackets. These coefficients are 
non-increasing. A process is called strongly a-mixing 
if a(ri) — ;> 0. Many important classes of processes sat- 
isfy the mixing conditions. For example, if a process 
is a stationary irreducible aperiodic Hidden Markov 
process, then it is a-mixing. If the underlying Markov 
chain is finite-state, then the coefficients decrease ex- 
ponentially fast. Other probabilistic assumptions can 
be used to obtain boun ds on the mixing coefficients, 
see e.g. iBradlev ( 2005 ) and references therein. 

Algorithm 2 is very simple. Its inputs are: sam- 
ples Ki,...,xn', the threshold level S G (0,1), the 
parameters m,l gN, iJ™'''". The algorithm assigns 
to the same cluster all samples which are at most 
5-isLT from each other, as measured by d{-,-). The 
estimate d{-, •) can be calculated in the same way as 
d{-,-) (see Proposition [T] and its proof). We do not 
give a pseudo code implementation of this algorithm, 
since it's rather obvious. 

The idea is that the threshold level 6 is selected 
according to the minimal length of a sample and the 
(known bounds on) mixing rates of the process p gen- 
erating the samples (see Theorem [2]). 

The next theorem shows that, if the joint distribu- 



7 



tion of the samples satisfies a{n) < ^ 0, where 
Q!„ are known, then one can select (based on only) 
the parameters of Algorithm 2 in such a way that 
it is weakly asymptotically consistent. Moreover, a 
bound on the probability of error before asymptotic 
is provided. 

Theorem 2 (Algorithm 2 is consistent, unknown k). 
Fix sequences a„ G (0,1), to„,/„,&„ G N, and let 
j^m,i,n j^m,i increasing sequence of finite sets, 

for each m,l G N. Set bn := maxz<;„_m<m^ |^»n,/,n|^ 
Let also (S„ G (0,1). Let N € N and suppose that 
the samples xi, . . . ,XAr are generated in such a way 
that the (unknown) joint distribution p is stationary 
ergodic, and satisfies a„(p) < an, for all n G N. 
Then for every sequence qn G [0..n/2], Algorithm 2, 
with the above parameters, satisfies 

p(r^/) <27V(7V+l)(m„W„7„((5„)+7„(£p)) (5) 

where 

7(5) - (2e-''"^'/32 + 11(1 + 4/<5)i/2g„a(„_2™„)/2,J, 

T is the partition output by the algorithm, I is the 
target clustering, Ep is a constant that depends only 
on p, and n = mini=i.jv Ui. 

In particular, if an = o{l), then, selecting 
the parameters in such a way that 6n — o(l), 

o{n), qn,mn,ln oo, UfceNJ 
S^'S 6™-' oo, for all m, / G N, and, finally. 



r}m,l.k 



mJnbnie + <5„^/^g„a(„_2m„)/2q„) = o(l), 

as is always possible. Algorithm 2 is weakly asymp- 
totically consistent (with the number of clusters k 
unknown). The computational complexity of Algo- 
rithm 2 is 0{N'^ 

Climax ^"max ^"max ) > '^'^'^ bOUndcd by 

0{N'^n'^^^logs~'^), where rtmax and log s^]^ are de- 
fined as in TheoremUl 



Proof. We use the following bound from lBosd (|1996[ ): 
for any zero-mean random process Yi , I2 , • ■ • , every 
n G N and every q G [l..n/2] we have 



ne 



< 4exp(-geV8) + 22(1 + 4:/ey^^qa{n/2q). 



For every j — 1..N, every m < n, I £ N, and B G 
S™'', define the processes ,¥2 , . . . , where 



\xi....,xl^^_,)eB 



Pm..rn e B). 



It is easy to see that a-mixing coefficients for this 
process satisfy a{n) < Q!„_2m- Thus, 

pMXln^ , B)-p{Xl^ G B)\ > s/2) < 7„(s) (6) 

Then for every i,j G [l..Af] such that /(x^) = /(xj) 
(that is, Xj and Xj are in the same cluster) we have 

p(|i.(Xi..„,,S) -z.(X^..„^,S)| > e) < 27„(e). 

Using the union bound, summing over to, I, and B, 
we obtain 



p{d{xi,Xj) > e) < 2to„ 



(7) 



Next, let i,j be such that /(x^) ^ li^j)- Then, for 

such that 

^ e a,,) I > 2n,, for 



some ruij , G N there is Bi j G B'^' 



some Ti j > 0. Then for every e < Ti,j /2 we have 
p{HXl,,,^,B,j) - ,.iXl,^^,B,j)\ < s) < 

p(|KXJ..„^,i?,,,) -p(xi..|5i G s,,)| > n^j) 

+ p(|K^f..„^.,i?,,,) -p(X^;.|^^^^l G B,.j)| > r,j) 

< 27„(r,j)- (8) 

Moreover, for e < Wm- -Wi- -Tij/2 

p((i(Xj,Xj) > e) < 2jniWm,jWl^ jTij). (9) 

Define Ep := minjj=i.jv:/(x.)5^/(x,) w™i_j'u;/i_jrij/2. 
Clearly, from this and ^ , for every e < 2ep we obtain 

p(d(x„Xj)>e)<27„(£p). (10) 

If, for every pair i,j of samples, d(xi,Xj) < if 
and only if /(x^) = then Algorithm 2 gives 

a correct answer. Therefore, taking the bounds ([7]) 
and Uni) together for each of the N{N -\- l)/2 pairs 
of samples, we obtain ([5]). The complexity statement 
can be established analogously to that in Theorem [1] 

□ 
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While Theorem [5] shows that a-mixing with a 
known bound on the coefficients is sufficient to 
achieve asymptotic consistency, the bound ^ on the 
probability of error includes as multiplicative terms 
all the parameters m„, /„ and 6„ of the algorithm, 
which can make it large for practically useful choices 
of the parameters. The multiplicative factors are due 
to the fact that we take a bound on the divergence 
of each individual frequency of each cell of each par- 
tition from its expectation, and then take a union 
bound over all of these. To obtain a more realis- 
tic performance guarantee, we would like to have 
a bound on the divergence of all the frequencies of 
all cells of a given partition from their expectations. 
Such uniform divergence estimates are possible un- 
der stronger assumptions; namely, they can be es- 
tablished under some assumptions on /3-mixing coef- 
ficients, which are defined as follows 



= E sup 

se(T(x„,. 



\P{B)^PiB\a{...,Xo))\ 



These coefficients satisfy 2a{n) < (3{n) (see e.g. iBosq 
so assumptions on the speed of decrease of 



/?-coeffic ients are stronger. Using the unifor m bounds 
given in iKarandikara fc Vidvasagar ( 2002 ) , one can 
obtain a statement similarto that in Theorem[2l with 
a-mixing replaced by /3-mixing, and without the mul- 
tiplicative factor 6„. 

4 Conclusion 

We have proposed a framework for defining consis- 
tency of clustering algorithms, when the data comes 
as a set of samples drawn from stationary processes. 
The main advantage of this framework is its general- 
ity: no assumptions have to be made on the distribu- 
tion of the data, beyond stationarity and ergodicity. 
The proposed notion of consistency is so simple and 
natural, that it may be suggested to be used as a ba- 
sic sanity-check for all clustering algorithms that are 
used on sequence-like data. For example, it is easy 
to see that the fc-means algorithm will be consistent 
with some initializations (e.g. with the one used in 
Algorithm [ij but not with others (e.g. not with the 
random one). 



While the algorithms that we presented to demon- 
strate the existence of consistent clustering meth- 
ods are computationally efficient and easy to imple- 
ment, the main value of the established results is 
theoretical. As it was mentioned in the introduc- 
tion, it can be suggested that for practical appli- 
cations empirical estimates of the distributional dis- 
tance can be replaced with dista nces based on dat a 
compression, m the s pirit o f Rvabko fc Astolal (|2006h : 
Cilibrasi fc Vitanvil (I2n05h : iRvabkol (l2009h . 

Another direction for future research concerns op- 
timal bounds on the speed of convergence: while we 
show that such bounds can be obtained (of course, 
only in the case of known mixing rates) , finding prac- 
tical and tight bounds, for different notions of mixing 
rates, remains open. 

Finally, here we have only considered the setting 
in which the number TV of samples is fixed, while 
the asymptotic is with respect to the lengths of the 
samples. For on-line clustering problems, it would be 
interesting to consider the formulation where both iV 
and the lengths of the samples grow. 
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